Hua XIAO Huai-Zong SHAO Qi-Cong PENG
In this paper, a robust sound source localization approach is proposed. The approach retains good performance even when model errors exist. Compared with previous work in this field, the contributions of this paper are as follows. First, an improved broad-band and near-field array model is proposed. It takes array gain, phase perturbations into account and is based on the actual positions of the elements. It can be used in arbitrary planar geometry arrays. Second, a subspace model errors estimation algorithm and a Weighted 2-Dimension Multiple Signal Classification (W2D-MUSIC) algorithm are proposed. The subspace model errors estimation algorithm estimates unknown parameters of the array model, i.e., gain, phase perturbations, and positions of the elements, with high accuracy. The performance of this algorithm is improved with the increasing of SNR or number of snapshots. The W2D-MUSIC algorithm based on the improved array model is implemented to locate sound sources. These two algorithms compose the robust sound source approach. The more accurate steering vectors can be provided for further processing such as adaptive beamforming algorithm. Numerical examples confirm effectiveness of this proposed approach.
Yusuke HIOKA Kazunori KOBAYASHI Ken'ichi FURUYA Akitoshi KATAOKA
A method for extracting a sound signal from a particular area that is surrounded by multiple ambient noise sources is proposed. This method performs several fixed beamformings on a pair of small microphone arrays separated from each other to estimate the signal and noise power spectra. Noise suppression is achieved by applying spectrum emphasis to the output of fixed beamforming in the frequency domain, which is derived from the estimated power spectra. In experiments performed in a room with reverberation, this method succeeded in suppressing the ambient noise, giving an SNR improvement of more than 10 dB, which is better than the performance of the conventional fixed and adaptive beamforming methods using a large-aperture microphone array. We also confirmed that this method keeps its performance even if the noise source location changes continuously or abruptly.
Kazunori KOBAYASHI Ken'ichi FURUYA Yoichi HANEDA Akitoshi KATAOKA
We previously proposed a method of sound source and microphone localization. The method estimates the locations of sound sources and microphones from only time differences of arrival between signals picked up by microphones even if all their locations are unknown. However, there is a problem that some estimation results converge to local minimum solutions because this method estimates locations iteratively and the error function has multiple minima. In this paper, we present a new iterative method to solve the local minimum problem. This method achieves accurate estimation by selecting effective initial locations from many random initial locations. The computer simulation and experimental results demonstrate that the presented method eliminates most local minimum solutions. Furthermore, the computational complexity of the presented method is similar to that of the previous method.
Yuki DENDA Takanobu NISHIURA Yoichi YAMASHITA
This paper describes a new talker direction estimation method for front-end processing to capture distant-talking speech by using a microphone array. The proposed method consists of two algorithms: One is a TDOA (Time Delay Of Arrival) estimation algorithm based on a weighted CSP (Cross-power Spectrum Phase) analysis with an average speech spectrum and CSP coefficient subtraction. The other is a talker direction estimation algorithm based on ML (Maximum Likelihood) estimation in a time sequence of the estimated TDOAs. To evaluate the effectiveness of the proposed method, talker direction estimation experiments were carried out in an actual office room. The results confirmed that the talker direction estimation performance of the proposed method is superior to that of the conventional methods in both diffused- and directional-noise environments.
Jwu-Sheng HU Wei-Han LIU Chieh-Cheng CHENG
In ASR (Automatic Speech Recognition) applications, one of the most important issues in the real-time beamforming of microphone arrays is the inability to capture the whole acoustic dynamics via a finite-length of data and a finite number of array elements. For example, the reflected source signal impinging from the side-lobe direction presents a coherent interference, and the non-minimal phase channel dynamics may require an infinite amount of data in order to achieve perfect equalization (or inversion). All these factors appear as uncertainties or un-modeled dynamics in the receiving signals. Traditional adaptive algorithms such as NLMS that do not consider these errors will result in performance deterioration. In this paper, a time domain beamformer using H∞ filtering approach is proposed to adjust the beamforming parameters. Furthermore, this work also proposes a frequency domain approach called SPFDBB (Soft Penalty Frequency Domain Block Beamformer) using H∞ filtering approach that can reduce computational efforts and provide a purified data to the ASR application. Experimental results show that the adaptive H∞ filtering method is robust to the modeling errors and suppresses much more noise interference than that in the NLMS based method. Consequently, the correct rate of ASR is also enhanced.
Jwu-Sheng HU Chieh-Cheng CHENG
This investigation proposed two array beamformers SPFDBB (Soft Penalty Frequency Domain Block Beamformer) and FDABB (Frequency Domain Adjustable Block Beamformer). Compared with the conventional beamformers, these frequency-domain methods can significantly reduce the computation power requirement in ASR (Automatic Speech Recognition) based applications. Like other reference signal based techniques, SPFDBB and FDABB minimize microphone's mismatch, desired signal cancellation caused by reflection effects and resolution due to the array's position. Additionally, these proposed methods are suitable for both near-field and far-field environments. Generally, the convolution relation between channel and speech source in time domain cannot be modeled accurately as a multiplication in the frequency domain with a finite window size, especially in ASR applications. SPFDBB and FDABB can approximate this multiplication by treating several frames as a block to achieve a better beamforming result. Moreover, FDABB adjusts the number of frames on-line to cope with the variation of characteristics in both speech and interference signals. A better performance was found to be achievable by combining these methods with an ASR mechanism.
Miki SATO Akihiko SUGIYAMA Osamu HOSHUYAMA Nobuyuki YAMASHITA Yoshihiro FUJITA
This paper proposes near-field sound-source localization based on crosscorrelation of a signed binary code. The signed binary code eliminates multibit signal processing for simpler implementation. Explicit formulae with near-field assumption are derived for a two microphone scenario and extended to a three microphone case with front-rear discrimination. Adaptive threshold for enabling and disabling source localization is developed for robustness in noisy environment. The proposed sound-source localization algorithm is implemented on a fixed-point DSP. Evaluation results in a robot scenario demonstrate that near-field assumption and front-rear discrimination provides almost 40% improvement in DOA estimation. A correct detection rate of 85% is obtained by a robot in a home environment.
Mitsuharu MATSUMOTO Shuji HASHIMOTO
This paper introduces the multiple signal classification (MUSIC) method that utilizes the transfer characteristics of microphones located at the same place, namely aggregated microphones. The conventional microphone array realizes a sound localization system according to the differences in the arrival time, phase shift, and the level of the sound wave among each microphone. Therefore, it is difficult to miniaturize the microphone array. The objective of our research is to build a reliable miniaturized sound localization system using aggregated microphones. In this paper, we describe a sound system with N microphones. We then show that the microphone array system and the proposed aggregated microphone system can be described in the same framework. We apply the multiple signal classification to the method that utilizes the transfer characteristics of the microphones placed at a same location and compare the proposed method with the microphone array. In the proposed method, all microphones are placed at the same place. Hence, it is easy to miniaturize the system. This feature is considered to be useful for practical applications. The experimental results obtained in an ordinary room are shown to verify the validity of the measurement.
A robust microphone array for speech enhancement and noise suppression is studied in this paper. To overcome target signal cancellation problem of conventional beamformer caused by array imperfections or reverberation effects of acoustic enclosure, the proposed microphone array adopts an arbitrary model of channel transfer function (TF) relating microphone and speech source. Since the estimation of channel TF itself is often intractable, herein, transfer function ratio (TFR) is estimated instead and used to form a suboptimal beamformer. A robust TFR estimation method is proposed based on signal subspace analysis technique against stationary or slowly varying noise. Experiments using simulated signal and actual signal recorded in a real room illustrate that the proposed method has high performance in adverse environment.
Shoji MAKINO Hiroshi SAWADA Ryo MUKAI Shoko ARAKI
This paper overviews a total solution for frequency-domain blind source separation (BSS) of convolutive mixtures of audio signals, especially speech. Frequency-domain BSS performs independent component analysis (ICA) in each frequency bin, and this is more efficient than time-domain BSS. We describe a sophisticated total solution for frequency-domain BSS, including permutation, scaling, circularity, and complex activation function solutions. Experimental results of 22, 33, 44, 68, and 22 (moving sources), (#sources#microphones) in a room are promising.
Tatsunori ASAI Hiroshi SARUWATARI Kiyohiro SHIKANO
This paper describes a new interface for a barge-in free spoken dialogue system combining an adaptive sound field control and a microphone array. In order to actualize robustness against the change of transfer functions due to the various interferences, the barge-in free spoken dialogue system which uses sound field control and a microphone array has been proposed by one of the authors. However, this method cannot follow the change of transfer functions because the method consists of fixed filters. To solve the problem, we introduce a new adaptive sound field control that follows the change of transfer functions.
Yang-Won JUNG Hong-Goo KANG Chungyong LEE Dae-Hee YOUN Changkyu CHOI Jaywoo KIM
In this paper, an adaptive microphone array system with a two-stage adaptation mode controller (AMC) is proposed for high-quality speech acquisition in real environments. The proposed system includes an adaptive array algorithm, a time-delay estimator and a newly proposed AMC. To ensure proper adaptation of the adaptive array algorithm, the proposed AMC uses not only temporal information, but also spatial information. The proposed AMC is constructed with two processing stages: an initialization stage and a running stage. In the initialization stage, a sound source localization technique is adopted, and a signal correlation characteristic is used in the running stage. For the adaptive array algorithm, a generalized sidelobe canceller with an adaptive blocking matrix is used. The proposed algorithm is implemented as a real-time man-machine interface module of a home-agent robot. Simulation results show 13 dB SINR improvement with the speaker sitting 2 m distance from the home-agent robot. The speech recognition rate is also enhanced by 32% when compared to the single channel acquisition system.
Weifeng LI Tetsuya SHINDE Hiroshi FUJIMURA Chiyomi MIYAJIMA Takanori NISHINO Katunobu ITOU Kazuya TAKEDA Fumitada ITAKURA
This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively.
Satoshi UKAI Tomoya TAKATANI Hiroshi SARUWATARI Kiyohiro SHIKANO Ryo MUKAI Hiroshi SAWADA
In this paper, single-input multiple-output (SIMO)-model-based blind source separation (BSS) is addressed, where unknown mixed source signals are detected at microphones, and can be separated, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. This technique is highly applicable to high-fidelity signal processing such as binaural signal processing. First, we provide an experimental comparison between two kinds of SIMO-model-based BSS methods, namely, conventional frequency-domain ICA with projection-back processing (FDICA-PB), and SIMO-ICA which was recently proposed by the authors. Secondly, we propose a new combination technique of the FDICA-PB and SIMO-ICA, which can achieve a higher separation performance than the two methods. The experimental results reveal that the accuracy of the separated SIMO signals in the simple SIMO-ICA is inferior to that of the signals obtained by FDICA-PB under low-quality initial value conditions, but the proposed combination technique can outperform both simple FDICA-PB and SIMO-ICA.
In this report, we propose a tracking algorithm of speaker direction using microphones located at vertices of an equilateral triangle. The method realizes tracking by minimizing a performance index that consists of the cross spectra at three different microphone pairs in the triangular array. We adopt the steepest descent method to minimize it, and for guaranteeing global convergence to the correct direction with high accuracy, we alter the performance index during the adaptation depending on the convergence state. Through some computer simulation and experiments in a real acoustic environment, we show the effectiveness of the proposed method.
Toshiharu HORIUCHI Mitsunori MIZUMACHI Satoshi NAKAMURA
This paper proposes a simple method for estimation and compensation of signal direction, to deal with relative change of sound source location caused by the movements of a microphone array and a sound source. This method introduces a delay filter that has shifted and sampled sinc functions. This paper presents a concept for the joint optimization of arrival time differences and of the coordinate system of a mobile microphone array. We use the LMS algorithm to derive this method by maintaining a certain relationship between the directions of the microphone array and the sound source directions. This method directly estimates the relative directions of the microphone array to the sound source directions by minimizing the relative differences of arrival time among the observed signals, not by estimating the time difference of arrival (TDOA) between two observed signals. This method also compensates the time delay of the observed signals simultaneously, and it has a feature to maintain that the output signals are in phase. Simulation results support effectiveness of the method.
Tomoya TAKATANI Tsuyoki NISHIKAWA Hiroshi SARUWATARI Kiyohiro SHIKANO
We newly propose a novel blind separation framework for Single-Input Multiple-Output (SIMO)-model-based acoustic signals using an extended ICA algorithm, SIMO-ICA. The SIMO-ICA consists of multiple ICAs and a fidelity controller, and each ICA runs in parallel under the fidelity control of the entire separation system. The SIMO-ICA can separate the mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. Thus, the separated signals of SIMO-ICA can maintain the spatial qualities of each sound source. In order to evaluate its effectiveness, separation experiments are carried out under both nonreverberant and reverberant conditions. The experimental results reveal that the signal separation performance of the proposed SIMO-ICA is the same as that of the conventional ICA-based method, and that the spatial quality of the separated sound in SIMO-ICA is remarkably superior to that of the conventional method, particularly for the fidelity of the sound reproduction.
Tsuyoki NISHIKAWA Hiroshi ABE Hiroshi SARUWATARI Kiyohiro SHIKANO Atsunobu KAMINUMA
We propose a new algorithm for overdetermined blind source separation (BSS) based on multistage independent component analysis (MSICA). To improve the separation performance, we have proposed MSICA in which frequency-domain ICA and time-domain ICA are cascaded. In the original MSICA, the specific mixing model, where the number of microphones is equal to that of sources, was assumed. However, additional microphones are required to achieve an improved separation performance under reverberant environments. This leads to alternative problems, e.g., a complication of the permutation problem. In order to solve them, we propose a new extended MSICA using subarray processing, where the number of microphones and that of sources are set to be the same in every subarray. The experimental results obtained under the real environment reveal that the separation performance of the proposed MSICA is improved as the number of microphones is increased.
Hongseok KWON Jongmok SON Keunsung BAE
This paper describes a new speech enhancement system that employs a microphone array with post-processing based on minimum mean-square error short-time spectral amplitude (MMSE-STSA) estimator. To get more accurate MMSE-STSA estimator in a microphone array, modification and refinement procedure are carried out from each microphone output. Performance of the proposed system is compared with that of other methods using a microphone array. Noise removal experiments for white and pink noises demonstrate the superiority of the proposed speech enhancement system to others with a microphone array in average output SNRs and cepstral distance measures.
Osamu ICHIKAWA Tetsuya TAKIGUCHI Masafumi NISHIMURA
In a two-microphone approach, interchannel differences in time (ICTD) and interchannel differences in sound level (ICLD) have generally been used for sound source localization. But those cues are not effective for vertical localization in the median plane (direct front). For that purpose, spectral cues based on features of head-related transfer functions (HRTF) have been investigated, but they are not robust enough against signal variations and environmental noise. In this paper, we use a "profile" as a cue while using a combination of reflectors specially designed for vertical localization. The observed sound is converted into a profile containing information about reflections as well as ICTD and ICLD data. The observed profile is decomposed into signal and noise by using template profiles associated with sound source locations. The template minimizing the residual of the decomposition gives the estimated sound source location. Experiments show this method can correctly provide a rough estimate of the vertical location even in a noisy environment.